A new initialization method for categorical data clustering

نویسندگان

  • Fuyuan Cao
  • Jiye Liang
  • Liang Bai
چکیده

In clustering algorithms, choosing a subset of representative examples is very important in data set. Such ''exemplars " can be found by randomly choosing an initial subset of data objects and then iteratively refining it, but this works well only if that initial choice is close to a good solution. In this paper, based on the frequency of attribute values, the average density of an object is defined. Furthermore, a novel ini-tialization method for categorical data is proposed, in which the distance between objects and the density of the object is considered. We also apply the proposed initialization method to k-modes algorithm and fuzzy k-modes algorithm. Experimental results illustrate that the proposed initialization method is superior to random initialization method and can be applied to large data sets for its linear time complexity with respect to the number of data objects. Clustering data based on a measure of similarity is a critical step in scientific data analysis and in engineering systems. A common method is to use data to learn a set of centers such that the sum of squared errors between objects and their nearest centers is small (Brendan & Delbert, 2007). At present, the popular partition clustering technique usually begins with an initial set of randomly selected exemplars and iteratively refines this set so as to decrease the sum of squared errors. Due to the simpleness, random initiali-zation method has been widely used. However, these clustering algorithms need to be rerun many times with different initializa-tions in an attempt to find a good solution. Furthermore, random initialization method works well only when the number of clusters is small and chances are good that at least one random initializa-tion is close to a good solution. Therefore, how to choose initial cluster centers is extremely important as they have a direct impact on the formation of final clusters. Based on the difference in data type, selection of initial cluster centers mainly can be classified into numeric data and categorical data. Aiming at numeric data, several attempts have been reported to solve the cluster initialization to date, few researches are concerned with initialization of categorical data. Huang (1998) introduced two initial mode selection methods for k-modes algorithm. The first method selects the first k distinct objects from the data set as the initial k-modes. The second method assigns the most frequent categories equally to the initial k-modes. Though the second method …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Farthest-Point Heuristic based Initialization Methods for K-Modes Clustering

The k-modes algorithm has become a popular technique in solving categorical data clustering problems in different application domains. However, the algorithm requires random selection of initial points for the clusters. Different initial points often lead to considerable distinct clustering results. In this paper we present an experimental study on applying a farthest-point heuristic based init...

متن کامل

ارائه یک الگوریتم خوشه بندی برای داده های دسته ای با ترکیب معیارها

Clustering is one of the main techniques in data mining. Clustering is a process that classifies data set into groups. In clustering, the data in a cluster are the closest to each other and the data in two different clusters have the most difference. Clustering algorithms are divided into two categories according to the type of data: Clustering algorithms for numerical data and clustering algor...

متن کامل

Cluster center initialization algorithm for K-modes clustering

Partitional clustering of categorical data is normally performed by using K-modes clustering algorithm, which works well for large datasets. Even though the design and implementation of K-modes algorithm is simple and efficient, it has the pitfall of randomly choosing the initial cluster centers for invoking every new execution that may lead to non-repeatable clustering results. This paper addr...

متن کامل

A cluster centers initialization method for clustering categorical data

Keywords: The k-modes algorithm Initialization method Initial cluster centers Density Distance a b s t r a c t The leading partitional clustering technique, k-modes, is one of the most computationally efficient clustering methods for categorical data. However, the performance of the k-modes clustering algorithm which converges to numerous local minima strongly depends on initial cluster centers...

متن کامل

Clustering Categorical Data Using Community Detection Techniques

With the advent of the k-modes algorithm, the toolbox for clustering categorical data has an efficient tool that scales linearly in the number of data items. However, random initialization of cluster centers in k-modes makes it hard to reach a good clustering without resorting to many trials. Recently proposed methods for better initialization are deterministic and reduce the clustering cost co...

متن کامل

Numerical and Categorical Attributes Data Clustering Using K- Modes and Fuzzy K-Modes

Most of the existing clustering approaches are applicable to purely numerical or categorical data only, but not the both. In general, it is a nontrivial task to perform clustering on mixed data composed of numerical and categorical attributes because there exists an awkward gap between the similarity metrics for categorical and numerical data. This paper therefore presents a general clustering ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Expert Syst. Appl.

دوره 36  شماره 

صفحات  -

تاریخ انتشار 2009